Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test

نویسنده

M. Assunta Hardy

چکیده

According to the equating guidelines, a set of common items should be a mini version of the total test in terms of content and statistical representation (Kolen & Brennan, 2004). Differences between vertical scaling and equating would suggest that these guidelines may not apply to vertical scaling in the same way that they apply to equating. This study investigated how well the guideline of content and construct representation was maintained while evaluating two stability assessment criteria (Robust z and 0.3-logit difference). The results indicated that linking sets that were not totally representative of the full test forms produced different vertical scales than the linking sets that were most representative of the full test forms. The results also showed that large disparities in the composition of linking sets produced statistically significant differences in the growth patterns of the resulting vertical scales, but small disparities in the composition of linking sets produced very similar vertical scales. Overall, the Robust z procedure was a more conservative approach to flagging unstable items. INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 3 Introduction The common-item design (CID) is a data collection plan widely used in creating a vertical scale. Examinees’ performance on the common items across test forms is used to indicate the amount of growth that occurs from grade to grade (Kolen & Brennan, 2004). Different decisions regarding the structure of the design and the composition of the linking item set may lead to different vertical scales (Camilli, Yamamoto, & Wang, 1993; Harris, 2007; Loyd & Hoover, 1980; Williams, Pommerich, & Thissen, 1998; Yen, 1986). The literature on test score equating provides some guidelines for constructing a test that includes common items as a method of collecting data (Kolen & Brennan, 2004). According to the guidelines, the set of common items should be a mini version of the total test in terms of content and statistical representation. Appropriately selecting common items for the linking set ensures that the common items represent the total test sufficiently. Potential common items are identified when adjacent test forms are constructed, but the common items that become part of the final linking set are those common items that are reasonably stable in difficulty across forms. The equating literature also provides several criteria for screening common items. Different criteria may result in different sets of linking items. The research on equating has produced helpful guidelines for selecting and screening common items, yet the differences between vertical scaling and equating would suggest that these guidelines may not apply to vertical scaling in the same way that they apply to equating. Through the equating process, the examinees’ location estimates are adjusted to account for differences in difficulty between the test forms and placed onto a common metric. In vertical scaling, the examinee groups that are administered the level tests are assumed to be different in INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 4 ability. The set of test questions from one test form to the other are deliberately designed to assess different levels of achievement. In a review of the literature, Cook and Petersen (1987) concluded that when groups differ in level of ability, special care must be taken when selecting the set of common items for the anchor test. Content representativeness of the items is an important concern and can seriously affect conventional equating results (Cook, Eignor, & Taft, 1985; Klien & Jajoura, 1985). In the context of vertical scaling, since the examinee groups are expected to differ in their level of achievement and the test forms differ in difficulty level, shifts in construct and content specifications tested across test forms can occur simply by design. In equating when common items are screened for stability, since the two test forms are expected to be interchangeable, the item-difficulty estimates for common items are expected to be almost the same. However, in vertical scaling the item difficulty estimates from two test forms across adjacent grades are expected to differ somewhat. Common items should typically be easier for the students at the higher grade level and more difficult for the students at the lower grade level and this should be reflected in the item difficulty estimates. Background and Rationale Given the differences between equating and vertical scaling, this study investigated how well the guideline of content and construct representation was maintained in the context of creating a vertical scale while evaluating two stability assessment criteria. This was accomplished by analyzing data gathered using the CID illustrated in Figure 1. This CID is a variation of a design proposed by Sudweeks et al. (2008) and it involved constructing a separate test for two mathematical constructs (Geometry and Measurement) intended to assess achievement relative to objectives in the Utah Core Curriculum for several adjacent grade levels. INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 5 Curricular Grade Level of Items G1 G2 G3 G4 G5 G6 G7 G8

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Investigating the Impact of Response Format on the Performance of Grammar Tests: Selected and Constructed

When constructing a test, an initial decision is choosing an appropriate item response format which can be classified as selected or constructed. In large-scale tests where time and finance are of concern, the use of response chosen known as multiple-choice items is quite widespread. This study aimed at investigating the impact of response format on the performance of structure tests. Concurren...

متن کامل

Investigating Gender and Major DIF in the Iranian National University Entrance Exam Using Multiple-Indicators Multiple-Causes Structural Equation Modelling

The generalizability aspect of Construct validity, as proposed by Messick (1989), requires that a test measures the same trait across different samples from the same population. Differential Item functioning (DIF) analysis is a key component in the fairness evaluation of educational tests. University entrance exam for the candidates who seek admission into master's English programs (MEUEE) at I...

متن کامل

Effect of Historical Buildings Representation in Cyberspace in Creating Tourists’ Destination Image (Qualitative Study of Traditional Accommodations in Kashan)

Introduction: Understanding the representation components of the historical buildings in cyberspace and their impact on the mental image of the tourists is a significant fact in tourism recognition and management. A part of this subject is the impact of place representation on the destination image of the tourist. In this research, the destination is traditional accommodations that attract tour...

متن کامل

Design and Psychometrics of an Assessment Tool for University Characteristics as a Learning Organization from the perspective of Educational Leaders

Introduction: Universities as learning organizations are places for transcendence, teaching, research and offering knowledge. The aim of this study was to design and assess psychometric properties of an assessment tool for university characteristics as a learning organization from the perspective of educational leaders. Methods: This mixed methods research was performed on faculty members of Te...

متن کامل

Reconstructing, Investigating the Reliability and Validity and Scoring the Stanford Diagnostic Reading Test

Objectives: The aim of the present study was to reconstruct determining validity, and score The Stanford Diagnostic Reading Test fourth edition (SDRT4) in the sixth grade students. Methods: The population of the study was all sixth grades of the 19 educational districts from Tehran, 571 students (255 boys and 316 girls) were selected by using a random multi-cluster sampling. The data were an...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test

نویسنده

چکیده

منابع مشابه

Investigating the Impact of Response Format on the Performance of Grammar Tests: Selected and Constructed

Investigating Gender and Major DIF in the Iranian National University Entrance Exam Using Multiple-Indicators Multiple-Causes Structural Equation Modelling

Effect of Historical Buildings Representation in Cyberspace in Creating Tourists’ Destination Image (Qualitative Study of Traditional Accommodations in Kashan)

Design and Psychometrics of an Assessment Tool for University Characteristics as a Learning Organization from the perspective of Educational Leaders

Reconstructing, Investigating the Reliability and Validity and Scoring the Stanford Diagnostic Reading Test

عنوان ژورنال:

اشتراک گذاری